Riffled Independence for Efficient Inference with Partial Rankings
نویسندگان
چکیده
Distributions over rankings are used to model data in a multitude of real world settings such as preference analysis and political elections. Modeling such distributions presents several computational challenges, however, due to the factorial size of the set of rankings over an item set. Some of these challenges are quite familiar to the artificial intelligence community, such as how to compactly represent a distribution over a combinatorially large space, and how to efficiently perform probabilistic inference with these representations. With respect to ranking, however, there is the additional challenge of what we refer to as human task complexity — users are rarely willing to provide a full ranking over a long list of candidates, instead often preferring to provide partial ranking information. Simultaneously addressing all of these challenges — i.e., designing a compactly representable model which is amenable to efficient inference and can be learned using partial ranking data — is a difficult task, but is necessary if we would like to scale to problems with nontrivial size. In this paper, we show that the recently proposed riffled independence assumptions cleanly and efficiently address each of the above challenges. In particular, we establish a tight mathematical connection between the concepts of riffled independence and of partial rankings. This correspondence not only allows us to then develop efficient and exact algorithms for performing inference tasks using riffled independence based representations with partial rankings, but somewhat surprisingly, also shows that efficient inference is not possible for riffle independent models (in a certain sense) with observations which do not take the form of partial rankings. Finally, using our inference algorithm, we introduce the first method for learning riffled independence based models from partially ranked data. 1. Probabilistic Modeling of Ranking Data: Three Challenges Rankings arise in a number of machine learning application settings such as preference analysis for movies and books (Lebanon & Mao, 2008) and political election analysis (Gormley & Murphy, 2007; Huang & Guestrin, 2010). In many of these problems, it is of great interest to build statistical models over ranking data in order to make predictions, form recommendations, discover latent trends and structure and to construct human-comprehensible data summaries. c ©2012 AI Access Foundation. All rights reserved. Huang, Kapoor & Guestrin Modeling distributions over rankings is a difficult problem, however, due to the fact that as the number of items being ranked increases, the number of possible rankings increases factorially. This combinatorial explosion forces us to confront three central challenges when dealing with rankings. First, we need to deal with storage complexity — how can we compactly represent a distribution over the space of rankings?1 Then there is algorithmic complexity — how can we efficiently answer probabilistic inference queries given a distribution? Finally, we must contend with what we refer to as human task complexity, which is a challenge stemming from the fact that it can be difficult to accurately elicit a full ranking over a large list of candidates from a human user; choosing from a list of n! options is no easy task and users typically prefer to provide partial information. Take the American Psychological Association (APA) elections, for example, which allow their voters to rank order candidates from favorite to least favorite. In the 1980 election, there were five candidates, and therefore 5! = 120 ways to rank those five candidates. Despite the small candidate list, most voters in the election preferred to only specify their top-k favorite candidates rather than writing down full rankings on their ballots (see Figure 1). For example, roughly a third of voters simply wrote down their single favorite candidate in this 1980 election. These three intertwined challenges of storage, algorithmic, and human task complexity are the central issues of probabilistic modeling for rankings, and models that do not efficiently handle all three sources of complexity have limited applicability. In this paper, we examine a flexible and intuitive class of models for rankings based on a generalization of probabilistic independence called riffled independence, proposed in our recent work (Huang & Guestrin, 2009, 2010). While our previous papers have focused primarily on representational (storage complexity) issues, we now concentrate on inference and incomplete observations (i.e., partial rankings), showing that in addition to storage complexity, riffle independence based models can efficiently address issues of algorithmic and human task complexity. In fact the two issues of algorithmic and human task complexity are intricately linked for riffle independent models. By considering partial rankings, we give users more flexibility to provide as much or as little information as they care to give. In the context of partial ranking data, the most relevant inference queries also take the form of partial rankings. For example, we might want to predict a voter’s second choice candidate given information about his first choice. One of our main contributions in this paper is to show that inference for such partial ranking queries can be performed particularly efficiently for riffle independent models. The main contributions of our work are as follows:2 • We reveal a natural and fundamental connection between riffle independent models and partial rankings. In particular, we show that the collection of partial rankings over an item set form a complete characterization of the space of observations upon 1. Note that it is common to wonder why one would care to represent a distribution over all rankings if the number of sample rankings is never nearly as large. This problem that the number of samples is always much smaller than n! however, means that most rankings are never observed, limiting our ability to estimate the probability of an arbitrary ranking. The only way to overcome the paucity of samples is to exploit representational structure, which is very much in alignment with solving the storage complexity issue. 2. This paper is an extended presentation of our paper (Huang, Kapoor, & Guestrin, 2011) which appeared in the 2011 Conference on Uncertainty in Artificial Intelligence (UAI) as well as results from the first author’s dissertation (Huang, 2011).
منابع مشابه
Uncovering the riffled independence structure of ranked data
Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling r...
متن کاملUncovering the Riffled Independence Structure of Rankings
Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling r...
متن کاملRiffled Independence for Ranked Data
Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling r...
متن کاملEfficient Probabilistic Inference with Partial Ranking Queries
Distributions over rankings are used to model data in various settings such as preference analysis and political elections. The factorial size of the space of rankings, however, typically forces one to make structural assumptions, such as smoothness, sparsity, or probabilistic independence about these underlying distributions. We approach the modeling problem from the computational principle th...
متن کاملE cient Probabilistic Inference with Partial Ranking Queries
Distributions over rankings are used to model data in various settings such as preference analysis and political elections. The factorial size of the space of rankings, however, typically forces one to make structural assumptions, such as smoothness, sparsity, or probabilistic independence about these underlying distributions. We approach the modeling problem from the computational principle th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Artif. Intell. Res.
دوره 44 شماره
صفحات -
تاریخ انتشار 2012